t-Plausibility: Generalizing Words to Desensitize Text

نویسندگان

  • Balamurugan Anandan
  • Chris Clifton
  • Wei Jiang
  • Mummoorthy Murugesan
  • Pedro Pastrana-Camacho
  • Luo Si
چکیده

De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates), they fail on two counts. The first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers. Observe that a diagnosis “tuberculosis” is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term “infectious disease” also reduces identifiability. That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of m...

متن کامل

دفاع از اصالت ادعیۀ اهل بیت(ع): مطالعۀ موردی دعای عرفه

The Arafa supplication with its rhythmic text is a prayer in about 3200 words, eloquent in style and with only a few different versions with slight variances. The subject matter of this supplication is to confess to the sublimity and glory of God, before the worldly, tragic and unstable situation of man. In addition, the praying person praises God for his abundant mercies and invokes God's bles...

متن کامل

High-Probability Syntactic Links

In this example, however, by the moment the word is has been read, the word p r o b l e m is already engaged in other strongly predicted constructions, namely the prepositional phrase of" this p r o b l e m and even the whole noun phrase the s o l u t i o n o f this p r o b l e m . A conflict arises, and plausibility of the new hypothesis becomes much lower. Such syntactic relations may concern...

متن کامل

Generalizing Automatically Generated Selectional Patterns

Frequency information on co-occurrence pa t te rns can be att tomatically collected from a syntactically analyzed corpus; this information can then serve as the basis for selectional constraints when analyzing new text; from the same domain. Tiffs information, however, is necessarily incomplete. We report on measurements of the degree of selectional coverage obtained with ditt\~rent sizes of co...

متن کامل

Learning a Scanning Understanding for "Real-world" Library Categorization

This paper describes, compares, and evaluates three different approaches for learning a semantic classification of library titles: 1) syntactically condensed titles, 2) complete titles, and 3) titles without insignificant words are used for learning the classification in connectionist recurrent plausibility networks. In particular, we demonstrate in this paper that automatically derived feature...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Trans. Data Privacy

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2012